{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Atividades\n",
    "\n",
    "1. Utilizamos a medida de Entropia como fator de decisão (medida de impureza de um nó). Teste o mesmo conjunto \n",
    "randômico de dados para a medida Gini e compare os resultados.\n",
    "Ref1.: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier\n",
    "Ref2.: https://en.wikipedia.org/wiki/Decision_tree_learning\n",
    "\n",
    "2. Faça o balanceamento dos dados contidos em \"train.csv\", aplique o algoritmo de Decision Tree e faça a submissão no kaggle. Tente melhorar o resultado obtido em sala de aula (posição 3100 no leaderboard).\n",
    "Dataset: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction\n",
    "\n",
    "3. (Opcional) Execute uma Random Forest na competição do Kaggle e veja se a acurácia melhora. Utilize 10, 100 ou 1000 árvores (dependendo de quanto o seu computador aguentar =]): http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Carregando os Dados"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/minhotmog/anaconda3/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
      "  \"This module will be removed in 0.20.\", DeprecationWarning)\n"
     ]
    }
   ],
   "source": [
    "# Bibliotecas\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.cross_validation import train_test_split\n",
    "from sklearn.metrics import classification_report"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>buying</th>\n",
       "      <th>maint</th>\n",
       "      <th>doors</th>\n",
       "      <th>persons</th>\n",
       "      <th>lug_boot</th>\n",
       "      <th>safety</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   buying  maint  doors  persons  lug_boot  safety  class\n",
       "0       1      1      1        2         2       0      1\n",
       "1       0      1      0        2         2       2      2\n",
       "2       2      2      0        2         2       2      2\n",
       "3       2      1      3        0         2       2      2\n",
       "4       0      0      3        1         0       1      2"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Carregando os dados\n",
    "headers = [\"buying\", \"maint\", \"doors\", \"persons\",\"lug_boot\", \"safety\", \"class\"]\n",
    "data = pd.read_csv(\"car_data.csv\", header=None, names=headers)\n",
    "\n",
    "# Embaralhando os dados\n",
    "data = data.sample(frac=1).reset_index(drop=True)\n",
    "\n",
    "# Categoriza todos os atributos para facilitar a Classificação\n",
    "for h in headers:\n",
    "    data[h] = data[h].astype('category')\n",
    "    data[h] = data[h].cat.codes\n",
    "\n",
    "# Exibição dos Dados\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Separação dos Dados em Treino e Teste\n",
    "X = data.iloc[:,:-1].values; y = data.iloc[:,-1].values;\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1ª Questão"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Definição do Classificador da Árvore de Decisão\n",
    "# 1 -> Entropia; 2 -> Gini\n",
    "dTree1 = DecisionTreeClassifier(criterion=\"entropy\")\n",
    "dTree2 = DecisionTreeClassifier(criterion=\"gini\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,\n",
       "            max_features=None, max_leaf_nodes=None,\n",
       "            min_impurity_split=1e-07, min_samples_leaf=1,\n",
       "            min_samples_split=2, min_weight_fraction_leaf=0.0,\n",
       "            presort=False, random_state=None, splitter='best')"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# É feito o fitting dos dois modelos anteriores utilizando os dados de treino\n",
    "dTree1.fit(X_train, y_train)\n",
    "dTree2.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "#### Modelo de Entropia ####\n",
      "Decision Tree Score: 0.971\n",
      "\n",
      "Classification Report:\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "          0       0.95      0.93      0.94       116\n",
      "          1       0.85      0.88      0.87        26\n",
      "          2       0.99      1.00      0.99       353\n",
      "          3       0.91      0.88      0.89        24\n",
      "\n",
      "avg / total       0.97      0.97      0.97       519\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Avaliação de Performance do Modelo 1\n",
    "print(\"#### Modelo de Entropia ####\")\n",
    "print(\"Decision Tree Score: {0:.3}\".format(dTree1.score(X_test, y_test)))\n",
    "\n",
    "print(\"\\nClassification Report:\")\n",
    "yPreds = dTree1.predict(X_test)\n",
    "print(classification_report(y_test, yPreds))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "#### Modelo de Gini ####\n",
      "Decision Tree Score: 0.971\n",
      "\n",
      "Classification Report:\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "          0       0.95      0.93      0.94       116\n",
      "          1       0.92      0.88      0.90        26\n",
      "          2       0.99      1.00      0.99       353\n",
      "          3       0.91      0.88      0.89        24\n",
      "\n",
      "avg / total       0.97      0.97      0.97       519\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Avaliação de Performance do Modelo 1\n",
    "print(\"#### Modelo de Gini ####\")\n",
    "print(\"Decision Tree Score: {0:.3}\".format(dTree2.score(X_test, y_test)))\n",
    "\n",
    "print(\"\\nClassification Report:\")\n",
    "yPreds = dTree2.predict(X_test)\n",
    "print(classification_report(y_test, yPreds))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2ª Questão"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>target</th>\n",
       "      <th>ps_ind_01</th>\n",
       "      <th>ps_ind_02_cat</th>\n",
       "      <th>ps_ind_03</th>\n",
       "      <th>ps_ind_04_cat</th>\n",
       "      <th>ps_ind_05_cat</th>\n",
       "      <th>ps_ind_06_bin</th>\n",
       "      <th>ps_ind_07_bin</th>\n",
       "      <th>ps_ind_08_bin</th>\n",
       "      <th>ps_ind_09_bin</th>\n",
       "      <th>...</th>\n",
       "      <th>ps_calc_11</th>\n",
       "      <th>ps_calc_12</th>\n",
       "      <th>ps_calc_13</th>\n",
       "      <th>ps_calc_14</th>\n",
       "      <th>ps_calc_15_bin</th>\n",
       "      <th>ps_calc_16_bin</th>\n",
       "      <th>ps_calc_17_bin</th>\n",
       "      <th>ps_calc_18_bin</th>\n",
       "      <th>ps_calc_19_bin</th>\n",
       "      <th>ps_calc_20_bin</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>11</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>6</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>9</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>8</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 58 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   target  ps_ind_01  ps_ind_02_cat  ps_ind_03  ps_ind_04_cat  ps_ind_05_cat  \\\n",
       "0       0          0              1          5              0              0   \n",
       "1       0          0              2          2              0              0   \n",
       "2       0          5              1          3              0              0   \n",
       "3       0          0              1          2              0              0   \n",
       "4       0          0              1          7              0              0   \n",
       "\n",
       "   ps_ind_06_bin  ps_ind_07_bin  ps_ind_08_bin  ps_ind_09_bin       ...        \\\n",
       "0              1              0              0              0       ...         \n",
       "1              1              0              0              0       ...         \n",
       "2              0              0              0              1       ...         \n",
       "3              0              1              0              0       ...         \n",
       "4              0              0              0              1       ...         \n",
       "\n",
       "   ps_calc_11  ps_calc_12  ps_calc_13  ps_calc_14  ps_calc_15_bin  \\\n",
       "0           5           4           2           5               0   \n",
       "1           3           1           3          11               0   \n",
       "2           6           3           3           5               0   \n",
       "3           3           1           4           7               0   \n",
       "4           9           4           4           8               0   \n",
       "\n",
       "   ps_calc_16_bin  ps_calc_17_bin  ps_calc_18_bin  ps_calc_19_bin  \\\n",
       "0               0               0               0               1   \n",
       "1               1               1               0               0   \n",
       "2               0               1               0               0   \n",
       "3               1               0               0               0   \n",
       "4               1               0               0               1   \n",
       "\n",
       "   ps_calc_20_bin  \n",
       "0               0  \n",
       "1               0  \n",
       "2               0  \n",
       "3               1  \n",
       "4               0  \n",
       "\n",
       "[5 rows x 58 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Carregando os dados\n",
    "data = pd.read_csv(\"train.csv\")\n",
    "\n",
    "# Embaralhando os dados\n",
    "data = data.drop([\"id\"], axis=1)\n",
    "data = data.sample(frac=1).reset_index(drop=True)\n",
    "\n",
    "# Exibição dos Dados\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Separação dos Dados em Treino e Teste\n",
    "X = data.iloc[:,1:].values; y = data.iloc[:,0].values;\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,\n",
       "            max_features=None, max_leaf_nodes=None,\n",
       "            min_impurity_split=1e-07, min_samples_leaf=1,\n",
       "            min_samples_split=2, min_weight_fraction_leaf=0.0,\n",
       "            presort=False, random_state=None, splitter='best')"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Definição do Classificador da Árvore de Decisão\n",
    "dTree = DecisionTreeClassifier(criterion=\"gini\")\n",
    "dTree.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "#### Modelo de Gini ####\n",
      "Decision Tree Score: 0.918\n",
      "\n",
      "Classification Report:\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "          0       0.96      0.95      0.96    114606\n",
      "          1       0.05      0.06      0.05      4437\n",
      "\n",
      "avg / total       0.93      0.92      0.92    119043\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Avaliação de Performance utilizando Conjunto de Teste\n",
    "yPred = dTree.predict(X_test)\n",
    "print(\"#### Modelo de Gini ####\")\n",
    "print(\"Decision Tree Score: {0:.3}\".format(dTree.score(X_test, y_test)))\n",
    "\n",
    "print(\"\\nClassification Report:\")\n",
    "print(classification_report(y_test, yPred))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Preparando submissão para o Kaggle\n",
    "# Carregando os dados\n",
    "data = pd.read_csv(\"test.csv\")\n",
    "yPred = dTree.predict(data.values[:,1:])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Salvando o .csv da Submissão\n",
    "data['target'] = yPred\n",
    "data[['id', 'target']].to_csv('submission.csv', index=False)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
